SF Salaries

The purpose of this project is to demonstrate the pandas library data manipulation operations. In order to do that, we'll use San Francisco dataset available on (https://www.kaggle.com/kaggle/sf-salaries). This dataset contains annual data of employees from 2011 to 2014.

Importing package


In [3]:
import pandas as pd

Read CSV into dataframe 'sal'


In [4]:
sal = pd.read_csv('Salaries.csv')

Head of dataframe


In [5]:
sal.head()


Out[5]:
Id EmployeeName JobTitle BasePay OvertimePay OtherPay Benefits TotalPay TotalPayBenefits Year Notes Agency Status
0 1 NATHANIEL FORD GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY 167411.18 0.00 400184.25 NaN 567595.43 567595.43 2011 NaN San Francisco NaN
1 2 GARY JIMENEZ CAPTAIN III (POLICE DEPARTMENT) 155966.02 245131.88 137811.38 NaN 538909.28 538909.28 2011 NaN San Francisco NaN
2 3 ALBERT PARDINI CAPTAIN III (POLICE DEPARTMENT) 212739.13 106088.18 16452.60 NaN 335279.91 335279.91 2011 NaN San Francisco NaN
3 4 CHRISTOPHER CHONG WIRE ROPE CABLE MAINTENANCE MECHANIC 77916.00 56120.71 198306.90 NaN 332343.61 332343.61 2011 NaN San Francisco NaN
4 5 PATRICK GARDNER DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT) 134401.60 9737.00 182234.59 NaN 326373.19 326373.19 2011 NaN San Francisco NaN

Finding some basic info about the data through the info() function


In [6]:
sal.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 148654 entries, 0 to 148653
Data columns (total 13 columns):
Id                  148654 non-null int64
EmployeeName        148654 non-null object
JobTitle            148654 non-null object
BasePay             148045 non-null float64
OvertimePay         148650 non-null float64
OtherPay            148650 non-null float64
Benefits            112491 non-null float64
TotalPay            148654 non-null float64
TotalPayBenefits    148654 non-null float64
Year                148654 non-null int64
Notes               0 non-null float64
Agency              148654 non-null object
Status              0 non-null float64
dtypes: float64(8), int64(2), object(3)
memory usage: 14.7+ MB

Average BasePay


In [7]:
sal['BasePay'].mean()


Out[7]:
66325.44884050643

Highest amount of overtime pay in the dataset


In [8]:
sal['OvertimePay'].max()


Out[8]:
245131.88

What is the job title of JOSEPH DRISCOLL?


In [9]:
sal[sal['EmployeeName'] == 'JOSEPH DRISCOLL']['JobTitle']


Out[9]:
24    CAPTAIN, FIRE SUPPRESSION
Name: JobTitle, dtype: object

How much does JOSEPH DRISCOLL make (including benefits)?


In [10]:
sal[sal['EmployeeName'] == 'JOSEPH DRISCOLL']['TotalPayBenefits']


Out[10]:
24    270324.91
Name: TotalPayBenefits, dtype: float64

The name of highest paid person (including benefits)?


In [11]:
sal[sal['TotalPayBenefits'] == sal['TotalPayBenefits'].max()]


Out[11]:
Id EmployeeName JobTitle BasePay OvertimePay OtherPay Benefits TotalPay TotalPayBenefits Year Notes Agency Status
0 1 NATHANIEL FORD GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY 167411.18 0.0 400184.25 NaN 567595.43 567595.43 2011 NaN San Francisco NaN

The name of lowest paid person


In [12]:
sal[sal['TotalPayBenefits']  == sal['TotalPayBenefits'].min()]


Out[12]:
Id EmployeeName JobTitle BasePay OvertimePay OtherPay Benefits TotalPay TotalPayBenefits Year Notes Agency Status
148653 148654 Joe Lopez Counselor, Log Cabin Ranch 0.0 0.0 -618.13 0.0 -618.13 -618.13 2014 NaN San Francisco NaN

The average (mean) BasePay of all employees per year? (2011-2014)?


In [13]:
sal.groupby('Year').mean()['BasePay']


Out[13]:
Year
2011    63595.956517
2012    65436.406857
2013    69630.030216
2014    66564.421924
Name: BasePay, dtype: float64

Number of unique job titles


In [14]:
sal['JobTitle'].nunique()


Out[14]:
2159

Top 5 most common jobs


In [15]:
grouped = sal.groupby('JobTitle').count()
top5 = grouped.sort_values(by='Id',  ascending=False)[:5]
top5['Id']


Out[15]:
JobTitle
Transit Operator                7036
Special Nurse                   4389
Registered Nurse                3736
Public Svc Aide-Public Works    2518
Police Officer 3                2421
Name: Id, dtype: int64

Job titles represented by only one person


In [16]:
copied_sal = sal[sal['Year'] == 2013]
group = copied_sal.groupby('JobTitle').count()
count = group[group['Id'] == 1]
count.count()['Id']


Out[16]:
202

How many people have the word Chief in their job title?


In [17]:
def find_chief(job_title):    
    if 'chief' in job_title.lower().split():
        return True
    else:
        return False

sal = pd.read_csv('Salaries.csv')

sum(sal['JobTitle'].apply(lambda x: find_chief(x)))


Out[17]:
477

Is there a correlation between length of the Job Title string and Salary?


In [18]:
sal['title_len'] = sal['JobTitle'].apply(len)

sal[['title_len', 'TotalPayBenefits']].corr()


Out[18]:
title_len TotalPayBenefits
title_len 1.000000 -0.036878
TotalPayBenefits -0.036878 1.000000